Getting Started

About the Dataset

This dataset explores the characteristics of about ~5000 white wines. Each wine is graded on a scale from 0 (very bad) to 10 (excellent). Additionnally to this grading, the dataset also gives several attributes of those wines, such as acidity, residual sugar, pH…

The goal of this analysis is to understand better what makes a good white wine, and how its ranking can be explained by such characteristics.

Set-up

Univariate Plots Section

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Our dataset consists of 4898 white wines evaluated on 12 different characteristics, 11 being the chemical characteristics of the wines (such as acidity, density, pH, etc) and the last one being the overall quality of the wine as assessed by experts.

Quality

Let’s first assess the distribution of the quality of our wines:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000
##   x freq
## 1 3   20
## 2 4  163
## 3 5 1457
## 4 6 2198
## 5 7  880
## 6 8  175
## 7 9    5

The median value for wine quality is 6/10, with 2198 wines with that rating. Only 5 wines in our sample are ranked as 9/10 in quality, and none of them made it to 10/10. Similarly, no wine is ranked lower than 3 out of 10.

Fixed Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

Most wines seem to have an acidity in between 6 and 8 g / dm^3, with outliers up to 14.2 g / dm^3. Let’s zoom in by limiting the axes in our graph:

The distribution is skewed right, with a peak in between 6.5 and 7 g / dm^3.

Let’s visualize those outliers along with the general distribution of the data in a boxplot:

Volatile Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Volatile acidity levels are comprised in between 0.2 and 0.4 g / dm^3, with some outliers presenting over 0.9 g / dm^3. Overall it is still quite evenly distributed, with the median and the mean of that variable being quite close (0.26 for the median and 0.2782 for the mean). If we zoom in:

We see indeed a peak in our distribution at around 2.5 g / dm^3.

Citric Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

The distribution here is also bell-shaped, with the median and the mean close to one another (0.32 g / dm^3 for the median vs 0.3342 g / dm^3 for the mean). The values range from 0 to 1.66 g / dm^3. In the graph we can see a peak at about 0.5 g / dm^3. Let’s zoom in:

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Interesting to see that we have over 200 wines showing a citric acid amount of exactly 0.49 g / dm^3, while the other amounts close to it (0.46 to 0.52 for instance) have much lower wine counts. The table above shows the metrics for this subset of wines with citric acid of 0.5 g / dm^3, and this particularity does not seem to have an impact of their quality ratings (ranging from 3 to 9 out of 10).

From the notes on the dataset, we know that “found in small quantities, citric acid can add ‘freshness’ and flavor to wines”. That seems to indicate a different way of looking at it than the two previous acidity measures; in the bivariate analysis section, we could try here to estimate what is a good “small quantity” by comparing citric acid concentrations to wine quality.

Residual Sugar

After studying acidity, let’s now look into sugar amounts in our wine:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Interestingly, a very large number of our wines seem to have rather low amounts of residual sugar. The mean of our dataset is at 6.391 g / dm^3 while the maximum possible value would be at 65.8 g / dm^3. Let’s zoom in on our dataset with residual sugar concentrations between 0 and 20 g / dm^3:

We see here that the distribution is long-tailed. Let’s see what we get using the log transform:

The distribution now looks bimodal.

My primary assumption would be that regarding white wines, a higher amount of residual sugar could mean better quality rankings - even though we do not know what type of white wines those Vinho Verde are.

Chlorides

Chlorides refer to the amount of salt in the wine:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

The distribution here is very long tailed. Let’s apply a log transform:

The log transform makes the variance decrease significantly and the chlorides distribution now appears normal.

Free Sulfur Dioxide

Free Sulfur Dioxide “prevents microbial growth and the oxidation of wine”.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

The distribution and the boxplot show an outlier at 289 mg / dm^3. Let’s remove it by limiting the axes:

The distribution now appears very close to normal, although slightly skewed to the right.

Total Sulfur Dioxide

Total sulfur dioxide is the sum of the amounts of free and bound forms of sulfur dioxide.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

We see a symetric, bell-shaped distribution for our total sample. There is also an outlier at 440 mg / dm^3, but this is probably caused by the outlier in the free sulfur dioxide distribution.

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Density also present an outlier at 1.04. Let’s remove it and zoom in:

Most of the wines seem to have a density in between 0.99 and 1.00, meaning close to the density of water.

pH

According to the notes, pH ranges from a scale of “0 (very acidic) to 14 (very basic)”.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

In our dataset, the wines have a pH in between 2.72 and 3.82. The distribution is symetric and bell-shaped.

Sulphates

According to the notes attached to our initial dataset; sulphates act as a “wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant”.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

The distribution is bell-shaped, a bit long tailed. The range of values is quite wide, going from 0.22 and 1.08.

Using log transform makes the distribution normal, thus eliminating most of the variance.

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

The boxplot shows an interesting distribution here, with a majority of datapoints that seem to be outside the second and thrid quartiles.

Applying a log10 to the variable does not make the distribution very normal. We’ll investigate this variable further in the next sections.

Univariate Analysis

What is the structure of your dataset?

Our dataset consists of 4898 white wines evaluated on 12 different characteristics. 11 of those characteristics are quantitative chemical observations for the wines (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol), while the last observation refers to their rankings in quality as assessed by experts (on a scale from 1 to 10).

What is/are the main feature(s) of interest in your dataset?

Given that this analysis is aiming to understand which variables are the biggest drivers in wine quality, quality is definitely the main feature of interest.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

While all 11 variables may have a role to play in the final quality of the wine, the univariate plots section helped me identify the key features I’ll focus on: acidity, residual sugar, chlorides and sulfur dioxide. I may also include alcohol if I see anything of interest.

I’m also interested in how those variables behave relatively to one another: for instance, I would assume that the wines with the highest proportion of residual sugar have the lowest acidity amounts; similarly, it can also impact their chlorides (chlorides being salt) amounts.

Did you create any new variables from existing variables in the dataset?

I did not create any variable. In the rest of the analysis, I may create additional variables to combine the acidity metrics (fixed and volatile) into one.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Most of the distributions except one presented a bell-shaped, normal-looking distribution, even though some of them were slightly skewed to the left or right of the x axis. I applied a log transform to some variables - residual sugar, chlorides, sulphates, and alcohol. The only remaining non-normal distributions after the log transform were the one of the residual sugar variable, whose histogram looked bi-modal, and the alcohol one.

Other adjustments that I regularly made was changing the bin width of the histograms, and “zooming in”" on the distibution by limiting the x-axis to remove outliers.

Bivariate Plots Section

Correlation - Overview

Let’s first remove the variable “x” from our dataset as it does not add any value to the analysis: the wines dataset now has 12 variables instead of 13.

Now we can use ggcorr to look at the variables’ correlation coefficients:

(An PNG image of this output is included in the submission)

Interestingly, the strongest correlations in regard to quality are chlorides (-0.21), density (-0.307) and alcohol (0.436). While quality is negatively correlated with chlorides and density, there is a strong positive correlation between quality and alcohol. I’ll investigate this further in the next section, “Biggest drivers of Quality”.

Alcohol itself is strongly negatively correlated with residual sugar (-0.451), total sulfur dioxide (-0.449) and density (-0.78). Given that alcohol seems from this chart the biggest driver in quality, I’ll also investigate the relationship between alcohol and those three variables in a second section, “Biggest drivers of Alcohol”.

Regarding the correlations between the other variables in the dataset: contrary to expectations, residual sugar has low correlation levels with fixed acidity (0.089) or chlorides (0.0887). However, it is most strongly linked to density (0.839) and total sulfur dioxide (0.401). VOlatile and fixed acidity variables, however, do not have any strong correlations with any other variables in the dataset, not even within themselves.

Biggest drivers of Quality

Quality and Chlorides

Let’s plot together quality and chlorides on different boxplots according to quality ratings:

To gain further visibility, I’ll create a new column, “rating”, which will take the value “low” if quality <= 4, “high” if quality >= 8, and average for everything in between. This is very similar to our two subsets of the previous section, but will allow us to have everything in one dataframe.

## 
## average    high     low 
##    4535     180     183

I have the same numbers of high and low quality wines as before.

Next, I’ll create again a scatterplot and three boxplots according to those wines quality rankings:

## wines$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04589 0.05000 0.34600 
## -------------------------------------------------------- 
## wines$rating: high
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01400 0.03000 0.03550 0.03801 0.04400 0.12100 
## -------------------------------------------------------- 
## wines$rating: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01300 0.03750 0.04600 0.05056 0.05400 0.29000

From the boxplots, we can see that while there is no visible difference between the average and the low quality boxplots, the high quality one presents a lower chlorides level.

Quality and Density

## wines$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9918  0.9938  0.9941  0.9962  1.0390 
## -------------------------------------------------------- 
## wines$rating: high
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0010 
## -------------------------------------------------------- 
## wines$rating: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9960  1.0000

Again here, the high quality wines tend to have lower densities than the average or low quality wines. This is interesting because we also saw that density is strongly correlated to residual sugar amounts (0.839), so lower levels of sugar can have a big impact for low density levels. I’d say that the correlation between density and quality is not as much a proof of causality than sugar and quality or alcohol and quality may be. Low density seems to me a consequence rather than a cause of high quality in wines.

Quality and Alcohol

## wines$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.40   10.30   10.48   11.30   14.20 
## -------------------------------------------------------- 
## wines$rating: high
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.65   12.60   14.00 
## -------------------------------------------------------- 
## wines$rating: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.40   10.10   10.17   10.80   13.50

Here the difference between high quality wines and the other ones is the strongest visually: high quality wines clearly have higher alcohol levels than average ones, which also have higher levels than the low quality wines. Because this is the strongest correlation found here in regard to quality, let’s analyze further what seems to be driving high levels of alcohol in wines.

Biggest drivers of Alcohol:

Alcohol and Residual Sugar

## [1] -0.4506312

The correlation between alcohol and residual sugar is -0.4506, illustrated by the regression line in the scatter plot above. I removed the only sweet wine in our dataset from this visualization as it is an outlier.

Alcohol and Total Sulfur Dioxide

## [1] -0.4488921

Alcohol and total sulfur dioxide are also negatively correlated at -0.4488. In the graphs above, I limited the y-axis to 350 mg / dm^3, exclusing one data point.

Alcohol and Density

## [1] -0.7801376

The graph above excludes the data point at 1.04 density.

The strongest correlation with alcohol is again density, at -0.7801. Again I am going to go with the assumption that this illustrates density as a consequence of higher alcohol an higher sugar levels and not a cause of high quality in wines.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Our main variable of interest was quality. We can see that quality has the strongest correlations with chlorides, density, and alcohol, alcohol being the strongest link of all. It is also the only positively correlated variable, chlorides and density being negatively correlated with quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Because the link between quality and alcohol was so strong, I also analyzed the biggest drivers in alcohol levels, mainly residual sugar, total sulfur dioxide, and density. Again the strongest link was with alcohol and density, but I’d interpret this as a correlation and not a causation.

What was the strongest relationship you found?

It was between density and residual sugar, with a correlation of 0.839.

Multivariate Plots Section

While our initial dataset did not have any categorical variable, we created one - rating - in the previous sections. Another way to gain further insights into our dataset would be to convert one of our features as a categorical variable, instead of numerical.

I’ll focus on the 3 biggest drivers of quality: chlorides, density, and alcohol, and the 2nd strongest correlation with alcohol apart from density: residual sugar.

I’ll split all those variables following their quartiles values, except for residual sugar which had shown to have a bimodal distribution. I’ll split this one at its median point.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390
## Length  Class   Mode 
##      0   NULL   NULL
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

We also removed the rows with NA values, to reach 4893 observations.

Chlorides, Density, and Alcohol

We can clearly see how alcohol levels impact density, while chlorides levels do not have that same impact. The correlation between alcohol and chlorides is getting stronger and stronger as density decreases.

Chlorides, Density, and Alcohol compared to Quality

This chart illustrates further how those three variables relate to quality: high quality wines showing higher alcohol levels, density, and lower chlorides amounts than the average and low quality wines. Intrestingly, lower quality wines also have on average lower levels of chlorides than average quality wines, which seems to indicate that after a certain threshold it is not such a big factor of quality anymore.

Density, Residual Sugar, Total Sulfur Dioxide, Alcohol

The biggest correlations with density are residual sugar (0.839), total sulfur dioxide (0.53), and alcohol (-0.78).

First, the relationship between residual sugar and total sulfur dioxide, with alcohol as a colour (limiting the x-axis - residual sugar - to 30 g / dm^3 to exclude the outlier, the sweet wine):

Alcohol levels seem to become lower and lower and sugar amounts increase.

We saw previously that the residual sugar distribution was bimodal; separating the variable by the median (so that we have the two “peaks” isolated), we see here that the density and total sulfur dioxide levels are also shifted to the right and up in the chart on the right: wines with higher residual sugar levels also have higher density (which is not surprising since the two are strongly correlated) and total sulfur dioxide levels.

Density, Residual Sugar and Total Sulfur Dioxide compared to Quality

Density, sugar and total sulfur dioxide are also the most impactful variables for alcohol, which is the most impactful variable for quality (we’ll use rating here).

We see even more clearly the relationship between density and residual sugar by looking at the colors of the chart: the darker color (higher residual sugar) grouped on the right of the graph with the x-axis for density. The distribution of total sulfur dioxide also seems more grouped towards the bottom of the chart compared to the low quality sample.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Density and residual sugar seem to go hand by hand with one another, such as alcohol levels and quality. Chlorides and total sulfur dioxide impacts are less easy to read, but also visible from the charts.

Were there any interesting or surprising interactions between features?

Density is definitely a variable that is impacted by a few other factors, which is also why I think it comes up as so strongly correlated to quality. Chlorides and total sulfur dioxide are also showing surprising relationships.


Final Plots and Summary

Plot One

Description One

This boxplot shows the different distribution for the wines in scope related to their quality rating (0 = lowest rating possible, 10 = highest rating possible). We can see clearly that high quality wines have higher levels of alcohol than the low quality wines; alcohol is the biggest driver of quality for this dataset.

Plot Two

Description Two

The second biggest driver of quality in our dataset is density: both variables are negatively correlated for -0.307, as shown in the boxplots above. As explained previously, to me low density is not a cause of good quality but rather a consequence of something else - and it is correlated at 83.9% with residual sugar amounts. This is illustrated in the second plot, with a clear regression line showing that density increases with residual sugar amounts.

Plot Three

Description Three

In addition to alcohol and density, chlorides is the third biggest driver of quality for white wines. We can see above how those three variables interact with each other in regard to quality (or rating).


Reflection

General thoughts

This project aimed to help me put the pieces of exploratory data analysis together, and it certainly was a great introduction to it. Starting from a dataset on a csv file, we had to clean, analyze and visualize the data in order to understand the relationship between different features of white wines, and most importantly quality. The fact that the dataset was already cleaned and formatted made it easier to get started and focus on the analysis / visualization part, while at the same time giving a complete view of those types of projects. On a technical standpoint, it also made me more comfortable with using R - whether it is looking up the libraries’ documentations, improving charts, or creating new variables from my dataset.

Struggles

For one thing, while having a dozen variables do not seem like much, it can become time-consuming to visualize each and every one of them before narrowing down to a few most important ones. The correlation matrix of the second section was definitely a great help for this, while the first section for univariate plots was more repetitive - plotting the distribution, looking at its shape or its outliers, analyzing a quartile values, and so on.

The second main struggle I encountered was choosing the types of plots that would better illustrate my analysis. In the first submission of this document for instance, I was doing boxplots with two numerical variables, and most of my visualizations were histograms and boxplots. I tried in this submission to give it more variety, with scatter plots or multiariate plots using regression lines for instance, even though there are still possible improvements to be made.

Successes

One success for me was to be able to make sense of both the distributions and the quartiles or correlation metrics, and build a story around this dataset that makes sense. Another one was to use R code chunks to improve my dataset, whether this was by creating categorical variables out of numerical ones, or creating vector to label axes differently for instance.

Improvement ideas

As mentioned above, adding more diversity to the type of visualizations could be a big improvement to this analysis. Another one could be to analyze more features in the multivariate section, instead of just focusing on the three or four most important ones. An final improvement could be to try to build a linear model to predict quality based on multiple features, something I did not tackle in this project.



Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!